Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (2)
cuda_core/pixi.toml:67
- Removing the
cu12environment from this subproject can break the repository’s top-levelpixi run -e cu12 testworkflow, which runspixi run --manifest-path cuda_core testunder the propagatedPIXI_ENVIRONMENT_NAME=cu12. If cu12 testing is still expected at the workspace level, consider keeping a solvablecu12environment here (e.g., using conda-forgecuda-bindings/cuda-versionconstraints instead of the path dependency) or updating the workspace test tasks to avoid selecting a missing environment.
# NOTE: cu12 environment is intentionally omitted because the path dependency
# to ../cuda_bindings (v13.1) makes it unsolvable locally. For cu12 testing,
# use conda-forge packages or CI workflows.
[environments]
default = { features = [
"cu13",
"test",
"cython-tests",
], solve-group = "default" }
cu13 = { features = ["cu13", "test", "cython-tests"], solve-group = "default" }
cuda_core/cuda/core/_tensor_map.pyx:461
c_pixel_box_lower/c_pixel_box_upperare declared as fixed-sizeint[3]but only the firstn_spatialentries are written. If the driver implementation reads all 3 entries (the API supports up to 3 spatial dims), the remaining uninitialized values can make encoding nondeterministic. Initialize the full arrays (e.g., set all 3 to 0 first) before filling the active elements.
cdef uint64_t[5] c_global_dim
cdef uint64_t[4] c_global_strides
cdef uint32_t[5] c_element_strides
cdef int[3] c_pixel_box_lower # max 3 spatial dims (rank 5 - 2)
cdef int[3] c_pixel_box_upper
cdef int i_c
for i_c in range(rank):
c_global_dim[i_c] = <uint64_t>shape[rank - 1 - i_c]
c_element_strides[i_c] = <uint32_t>element_strides[rank - 1 - i_c]
for i_c in range(rank - 1):
c_global_strides[i_c] = <uint64_t>byte_strides[rank - 2 - i_c]
# Reverse spatial dimensions for lower/upper corners
for i_c in range(n_spatial):
c_pixel_box_lower[i_c] = <int>pixel_box_lower_corner[n_spatial - 1 - i_c]
c_pixel_box_upper[i_c] = <int>pixel_box_upper_corner[n_spatial - 1 - i_c]
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| view = _get_validated_view(tensor) | ||
| desc._source_ref = tensor | ||
|
|
There was a problem hiding this comment.
TensorMapDescriptor stores _source_ref = tensor, but when tensor is a DLPack producer the pointer/metadata lifetime is governed by the DLPack capsule returned by __dlpack__(). Since the temporary StridedMemoryView (which holds the capsule and calls the deleter in __dealloc__) is not retained, the capsule can be released immediately, potentially invalidating globalAddress for exporters where the capsule owns the backing allocation. Store a strong reference to the StridedMemoryView (or at least its metadata capsule) instead of (or in addition to) the original tensor object.
|
/ok to test |
|
|
/ok to test |
1 similar comment
|
/ok to test |
leofang
left a comment
There was a problem hiding this comment.
There is a coordinated effort between C++ and Python: #199 (comment). Can we please look into reusing the C++ implementation (mainly because @fbusato is a TMA expert) and avoid re-implementing it if possible?
|
Fighting with poor documentation and bugs don't make me an expert :).
The implementation of |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (3)
cuda_core/tests/test_tensor_map.py:102
- This test passes a raw
Bufferfromdev.allocate()withdata_type=FLOAT32.Bufferexports via DLPack as an int8 tensor with shape=(n_bytes,), so the TMA encoder will treatshape[0]as a float32 element count unless the implementation compensates for this. That can create a descriptor that covers 4× more memory than the allocation and hide potential out-of-bounds issues. Prefer wrapping the buffer in_DeviceArray(buf, (1024,), dtype=np.float32)(orStridedMemoryView.from_bufferwith the intended shape/dtype) so the descriptor is built from element-count dimensions matching the data type.
buf = dev.allocate(1024 * 4) # 1024 float32 elements
desc = TensorMapDescriptor.from_tiled(
buf,
box_dim=(64,),
data_type=TensorMapDataType.FLOAT32,
)
cuda_core/tests/test_tensor_map.py:277
- Same issue as
test_from_tiled_1d: building a descriptor from a rawBufferwithdata_type=FLOAT32relies on the implementation translating the buffer's byte-length into a float32 element count. To avoid encoding a descriptor with incorrectglobal_dim, wrapbuf1/buf2in_DeviceArray(..., dtype=np.float32)(or aStridedMemoryViewwith the intended dtype/shape) before callingfrom_tiled()/replace_address().
def test_replace_address(self, dev, skip_if_no_tma):
buf1 = dev.allocate(1024 * 4)
desc = TensorMapDescriptor.from_tiled(
buf1,
box_dim=(64,),
data_type=TensorMapDataType.FLOAT32,
)
cuda_core/cuda/core/_kernel_arg_handler.pyx:305
- Support for passing
TensorMapDescriptoras a kernel argument is added here, but there’s no test exercising the full path (ParamHolder → cuLaunchKernel) with a realTensorMapDescriptorargument. Givencuda_core/tests/test_launcher.pyalready validates scalar/buffer argument handling, consider adding a small integration test that launches a kernel taking aCUtensorMapby value and verifies it can be consumed (or at least that the kernel receives the expected 128-byte payload). This will protect against ABI/size/alignment regressions in the argument marshalling logic.
elif arg_type is tensor_map_descriptor_type:
prepare_tensor_map_arg(self.data, self.data_addresses, <TensorMapDescriptor>arg, i)
continue
elif arg_type is bool:
prepare_arg[cpp_bool](self.data, self.data_addresses, arg, i)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unused _alloc_device_tensor helper from tests - Add test for rank > 5 (6D tensor) to verify upper bound validation - Add NULL check for PyMem_Malloc in prepare_tensor_map_arg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the replace_address() demonstration into its own self-contained example (tma_replace_address.py) so each file covers a single concept. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…alidated views alive to avoid DLPack-backed pointer lifetime hazards. Add explicit tiled element-stride coverage and acknowledge the DLPack include-layout compatibility follow-up in NVIDIA/cccl#7871. Made-with: Cursor
f4875f6 to
96a3e84
Compare
|
/ok to test |
Probe support in the fixture and skip when cuda.core is built without CUDA 13 im2col-wide support or when the driver/GPU reports CUDA_ERROR_INVALID_VALUE, so unsupported RTXPRO6000 lanes don't block unrelated changes. Made-with: Cursor
|
/ok to test |
…safety. Expose only TensorMapDescriptor in cuda.core, add StridedMemoryView.as_tensor_map(), remove redundant tensor-map fallback packing, and track/check descriptor context/device compatibility before replacement and kernel launch argument packing. Made-with: Cursor
|
/ok to test |
Bring back the cu12 feature blocks so pixi can parse the manifest and local test commands no longer fail early with a missing feature error. Made-with: Cursor
|
/ok to test |
Reject CUDA device-local tensors from a different GPU while still allowing CUDA host and managed memory. Add regression tests for descriptor creation, replace_address, and the shared validation helper.
|
/ok to test |
|
@leofang Please review. This is now blocked on your review. |
| # if __has_include(<dlpack/dlpack.h>) | ||
| # include <dlpack/dlpack.h> | ||
| # define CUDA_CORE_HAS_DLPACK_H 1 | ||
| # else | ||
| # define CUDA_CORE_HAS_DLPACK_H 0 | ||
| # endif |
There was a problem hiding this comment.
Q: we guarantee to have a dlpack.h during build time, but it's not accessible via <dlpack/dlpack.h>, so does it mean we end up with CUDA_CORE_HAS_DLPACK_H == 0?
There was a problem hiding this comment.
Yes, this is generally good practice in C++ code so you can avoid having to check if a pre-processor macro is defined before checking its value.
#if defined(CUDA_CORE_HAS_DLPACK_H) && CUDA_CORE_HAS_DLPACK_H == 0
// ...
#endif
|
|
||
| #if defined(__has_include) | ||
| # if __has_include(<cuda/tma>) | ||
| # include <cuda/tma> |
There was a problem hiding this comment.
Q: I am confused -- The TMA header was added fairly recently. We build cuda.core against both CUDA 12 & 13 and merge the resulting wheels. Without vendoring the CCCL header, how did we manage to build and make CI green? 🤔
There was a problem hiding this comment.
It's only available via CTK until CUDA 13.2, meaning before this week it was not there.
https://nvidia.github.io/cccl/unstable/libcudacxx/extended_api/tma/make_tma_descriptor.html
There was a problem hiding this comment.
The dependent code is only compiled in when the header is available: CUDA_CORE_HAS_CUDA_TMA.
| vector.vector[void*]& data_addresses, | ||
| TensorMapDescriptor arg, | ||
| const size_t idx) except -1: | ||
| arg._check_context_compat() |
There was a problem hiding this comment.
I am torn on this. I understand the rational but this call will be very slow since it involves multiple driver API calls.
We should check this when a TMA is constructed in Python, memoize the device/context attributes, and then move on. For example, we don't do pointer attribute check at launch time either. It adds just too much overhead.
| # Allocate a temporary buffer for the 128-byte CUtensorMap struct. | ||
| # We copy rather than pointing directly at arg._tensor_map for lifetime | ||
| # safety: ParamHolder owns and frees its argument buffers independently. | ||
| cdef void* ptr = PyMem_Malloc(sizeof(cydriver.CUtensorMap)) | ||
| if ptr is NULL: | ||
| raise MemoryError("Failed to allocate memory for CUtensorMap") | ||
| memcpy(ptr, arg._get_data_ptr(), sizeof(cydriver.CUtensorMap)) |
There was a problem hiding this comment.
This is unnecessary because the driver will copy again. Just pack the pointer and pass it to cuLaunchKernel, and the driver will copy it over.
| // Initialise a single-phase mbarrier (1 arriving thread). | ||
| asm volatile( | ||
| "mbarrier.init.shared.b64 [%0], 1;" | ||
| :: "r"((unsigned)__cvta_generic_to_shared(&mbar))); | ||
|
|
||
| // Ask TMA to copy TILE_SIZE floats starting at element 'tile_start' | ||
| // from the tensor described by 'tensor_map' into shared memory. | ||
| asm volatile( | ||
| "cp.async.bulk.tensor.1d.shared::cluster.global.tile" | ||
| ".mbarrier::complete_tx::bytes" | ||
| " [%0], [%1, {%2}], [%3];" | ||
| :: "r"((unsigned)__cvta_generic_to_shared(smem)), | ||
| "l"(&tensor_map), | ||
| "r"(tile_start), | ||
| "r"((unsigned)__cvta_generic_to_shared(&mbar))); | ||
|
|
||
| // Tell the mbarrier how many bytes the TMA will deliver. | ||
| asm volatile( | ||
| "mbarrier.arrive.expect_tx.shared.b64 _, [%0], %1;" | ||
| :: "r"((unsigned)__cvta_generic_to_shared(&mbar)), | ||
| "r"((unsigned)(TILE_SIZE * sizeof(float)))); |
| asm volatile( | ||
| "{ .reg .pred P; \n" | ||
| "WAIT: \n" | ||
| " mbarrier.try_wait.parity.shared.b64 P, [%0], 0; \n" | ||
| " @!P bra WAIT; \n" | ||
| "} \n" | ||
| :: "r"((unsigned)__cvta_generic_to_shared(&mbar))); |
There was a problem hiding this comment.
I now think this example should just be combined with tma_tensor_map.py, since we have lots of code repetition here.
Co-authored-by: Leo Fang <leo80042@gmail.com>
Co-authored-by: Leo Fang <leo80042@gmail.com>
Co-authored-by: Leo Fang <leo80042@gmail.com>
Summary
TensorMapDescriptorCython class wrapping the CUDA driver'sCUtensorMapfor Hopper+ TMA (Tensor Memory Accelerator) bulk data movementfrom_tiled()andfrom_im2col()class methods, with automatic dtype inference, stride computation, and validationTensorMapDescriptoras a first-class kernel argument in_kernel_arg_handler.pyxtest_tensor_map.py) and an example (tma_tensor_map.py)Closes #199
Closes #200